{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jBasof3bv1LB"
   },
   "source": [
    "<h1><center>How to export 🤗 Transformers Models to ONNX ?<h1><center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[ONNX](http://onnx.ai/) is open format for machine learning models. It allows you to save your neural network's computation graph in a framework agnostic way, which might be particularly helpful when deploying deep learning models.\n",
    "\n",
    "Indeed, businesses might have other requirements _(languages, hardware, ...)_ for which the training framework might not be the best suited in inference scenarios. In that context, having a representation of the actual computation graph that can be shared accross various business units and logics across an organization might be a desirable component.\n",
    "\n",
    "Along with the serialization format, ONNX also provides a runtime library which allows efficient and hardware specific execution of the ONNX graph. This is done through the [onnxruntime](https://microsoft.github.io/onnxruntime/) project and already includes collaborations with many hardware vendors to seamlessly deploy models on various platforms.\n",
    "\n",
    "Through this notebook we'll walk you through the process to convert a PyTorch or TensorFlow transformers model to the [ONNX](http://onnx.ai/) and leverage [onnxruntime](https://microsoft.github.io/onnxruntime/) to run inference tasks on models from  🤗 __transformers__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "yNnbrSg-5e1s"
   },
   "source": [
    "## Exporting 🤗 transformers model to ONNX\n",
    "\n",
    "---\n",
    "\n",
    "Exporting models _(either PyTorch or TensorFlow)_ is easily achieved through the conversion tool provided as part of 🤗 __transformers__ repository. \n",
    "\n",
    "Under the hood the process is sensibly the following: \n",
    "\n",
    "1. Allocate the model from transformers (**PyTorch or TensorFlow**)\n",
    "2. Forward dummy inputs through the model this way **ONNX** can record the set of operations executed\n",
    "3. Optionally define dynamic axes on input and output tensors\n",
    "4. Save the graph along with the network parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "!{sys.executable} -m pip install --upgrade git+https://github.com/huggingface/transformers\n",
    "!{sys.executable} -m pip install --upgrade torch==1.6.0+cpu torchvision==0.7.0+cpu -f https://download.pytorch.org/whl/torch_stable.html\n",
    "!{sys.executable} -m pip install --upgrade onnxruntime==1.4.0\n",
    "!{sys.executable} -m pip install -i https://test.pypi.org/simple/ ort-nightly\n",
    "!{sys.executable} -m pip install --upgrade onnxruntime-tools"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "PwAaOchY4N2-"
   },
   "outputs": [],
   "source": [
    "!rm -rf onnx/\n",
    "from pathlib import Path\n",
    "from transformers.convert_graph_to_onnx import convert\n",
    "\n",
    "# Handles all the above steps for you\n",
    "convert(framework=\"pt\", model=\"bert-base-cased\", output=Path(\"onnx/bert-base-cased.onnx\"), opset=11)\n",
    "\n",
    "# Tensorflow \n",
    "# convert(framework=\"tf\", model=\"bert-base-cased\", output=\"onnx/bert-base-cased.onnx\", opset=11)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to leverage runtime for inference over an ONNX graph\n",
    "\n",
    "---\n",
    "\n",
    "As mentionned in the introduction, **ONNX** is a serialization format and many side projects can load the saved graph and run the actual computations from it. Here, we'll focus on the official [onnxruntime](https://microsoft.github.io/onnxruntime/). The runtime is implemented in C++ for performance reasons and provides API/Bindings for C++, C, C#, Java and Python.\n",
    "\n",
    "In the case of this notebook, we will use the Python API to highlight how to load a serialized **ONNX** graph and run inference workload on various backends through **onnxruntime**.\n",
    "\n",
    "**onnxruntime** is available on pypi:\n",
    "\n",
    "- onnxruntime: ONNX + MLAS (Microsoft Linear Algebra Subprograms)\n",
    "- onnxruntime-gpu: ONNX + MLAS + CUDA\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "!pip install transformers onnxruntime-gpu onnx psutil matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-gP08tHfBvgY"
   },
   "source": [
    "## Preparing for an Inference Session\n",
    "\n",
    "---\n",
    "\n",
    "Inference is done using a specific backend definition which turns on hardware specific optimizations of the graph. \n",
    "\n",
    "Optimizations are basically of three kinds: \n",
    "\n",
    "- **Constant Folding**: Convert static variables to constants in the graph \n",
    "- **Deadcode Elimination**: Remove nodes never accessed in the graph\n",
    "- **Operator Fusing**: Merge multiple instruction into one (Linear -> ReLU can be fused to be LinearReLU)\n",
    "\n",
    "ONNX Runtime automatically applies most optimizations by setting specific `SessionOptions`.\n",
    "\n",
    "Note:Some of the latest optimizations that are not yet integrated into ONNX Runtime are available in [optimization script](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) that tunes models for the best performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# # An optional step unless\n",
    "# # you want to get a model with mixed precision for perf accelartion on newer GPU\n",
    "# # or you are working with Tensorflow(tf.keras) models or pytorch models other than bert\n",
    "\n",
    "# !pip install onnxruntime-tools\n",
    "# from onnxruntime_tools import optimizer\n",
    "\n",
    "# # Mixed precision conversion for bert-base-cased model converted from Pytorch\n",
    "# optimized_model = optimizer.optimize_model(\"bert-base-cased.onnx\", model_type='bert', num_heads=12, hidden_size=768)\n",
    "# optimized_model.convert_model_float32_to_float16()\n",
    "# optimized_model.save_model_to_file(\"bert-base-cased.onnx\")\n",
    "\n",
    "# # optimizations for bert-base-cased model converted from Tensorflow(tf.keras)\n",
    "# optimized_model = optimizer.optimize_model(\"bert-base-cased.onnx\", model_type='bert_keras', num_heads=12, hidden_size=768)\n",
    "# optimized_model.save_model_to_file(\"bert-base-cased.onnx\")\n",
    "\n",
    "\n",
    "# optimize transformer-based models with onnxruntime-tools\n",
    "from onnxruntime_tools import optimizer\n",
    "from onnxruntime_tools.transformers.onnx_model_bert import BertOptimizationOptions\n",
    "\n",
    "# disable embedding layer norm optimization for better model size reduction\n",
    "opt_options = BertOptimizationOptions('bert')\n",
    "opt_options.enable_embed_layer_norm = False\n",
    "\n",
    "opt_model = optimizer.optimize_model(\n",
    "    'onnx/bert-base-cased.onnx',\n",
    "    'bert', \n",
    "    num_heads=12,\n",
    "    hidden_size=768,\n",
    "    optimization_options=opt_options)\n",
    "opt_model.save_model_to_file('bert.opt.onnx')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from os import environ\n",
    "from psutil import cpu_count\n",
    "\n",
    "# Constants from the performance optimization available in onnxruntime\n",
    "# It needs to be done before importing onnxruntime\n",
    "environ[\"OMP_NUM_THREADS\"] = str(cpu_count(logical=True))\n",
    "environ[\"OMP_WAIT_POLICY\"] = 'ACTIVE'\n",
    "\n",
    "from onnxruntime import GraphOptimizationLevel, InferenceSession, SessionOptions, get_all_providers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2k-jHLfdcTFS"
   },
   "outputs": [],
   "source": [
    "from contextlib import contextmanager\n",
    "from dataclasses import dataclass\n",
    "from time import time\n",
    "from tqdm import trange\n",
    "\n",
    "def create_model_for_provider(model_path: str, provider: str) -> InferenceSession: \n",
    "  \n",
    "  assert provider in get_all_providers(), f\"provider {provider} not found, {get_all_providers()}\"\n",
    "\n",
    "  # Few properties that might have an impact on performances (provided by MS)\n",
    "  options = SessionOptions()\n",
    "  options.intra_op_num_threads = 1\n",
    "  options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL\n",
    "\n",
    "  # Load the model as a graph and prepare the CPU backend \n",
    "  session = InferenceSession(model_path, options, providers=[provider])\n",
    "  session.disable_fallback()\n",
    "    \n",
    "  return session\n",
    "\n",
    "\n",
    "@contextmanager\n",
    "def track_infer_time(buffer: [int]):\n",
    "    start = time()\n",
    "    yield\n",
    "    end = time()\n",
    "\n",
    "    buffer.append(end - start)\n",
    "\n",
    "\n",
    "@dataclass\n",
    "class OnnxInferenceResult:\n",
    "  model_inference_time: [int]  \n",
    "  optimized_model_path: str"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "teJdG3amE-hR"
   },
   "source": [
    "## Forwarding through our optimized ONNX model running on CPU\n",
    "\n",
    "---\n",
    "\n",
    "When the model is loaded for inference over a specific provider, for instance **CPUExecutionProvider** as above, an optimized graph can be saved. This graph will might include various optimizations, and you might be able to see some **higher-level** operations in the graph _(through [Netron](https://github.com/lutzroeder/Netron) for instance)_ such as:\n",
    "- **EmbedLayerNormalization**\n",
    "- **Attention**\n",
    "- **FastGeLU**\n",
    "\n",
    "These operations are an example of the kind of optimization **onnxruntime** is doing, for instance here gathering multiple operations into bigger one _(Operator Fusing)_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "dmC22kJfVGYe",
    "outputId": "f3aba5dc-15c0-4f82-b38c-1bbae1bf112e"
   },
   "outputs": [],
   "source": [
    "from transformers import BertTokenizerFast\n",
    "\n",
    "tokenizer = BertTokenizerFast.from_pretrained(\"bert-base-cased\")\n",
    "cpu_model = create_model_for_provider(\"onnx/bert-base-cased.onnx\", \"CPUExecutionProvider\")\n",
    "\n",
    "# Inputs are provided through numpy array\n",
    "model_inputs = tokenizer(\"My name is Bert\", return_tensors=\"pt\")\n",
    "inputs_onnx = {k: v.cpu().detach().numpy() for k, v in model_inputs.items()}\n",
    "\n",
    "# Run the model (None = get all the outputs)\n",
    "sequence, pooled = cpu_model.run(None, inputs_onnx)\n",
    "\n",
    "# Print information about outputs\n",
    "\n",
    "print(f\"Sequence output: {sequence.shape}, Pooled output: {pooled.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Benchmarking PyTorch model\n",
    "\n",
    "_Note: PyTorch model benchmark is run on CPU_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "PS_49goe197g",
    "outputId": "0ef0f70c-f5a7-46a0-949a-1a93f231d193"
   },
   "outputs": [],
   "source": [
    "from transformers import BertModel\n",
    "\n",
    "PROVIDERS = {\n",
    "    (\"cpu\", \"PyTorch CPU\"),\n",
    "#  Uncomment this line to enable GPU benchmarking\n",
    "#    (\"cuda:0\", \"PyTorch GPU\")\n",
    "}\n",
    "\n",
    "results = {}\n",
    "\n",
    "for device, label in PROVIDERS:\n",
    "    \n",
    "    # Move inputs to the correct device\n",
    "    model_inputs_on_device = {\n",
    "        arg_name: tensor.to(device)\n",
    "        for arg_name, tensor in model_inputs.items()\n",
    "    }\n",
    "\n",
    "    # Add PyTorch to the providers\n",
    "    model_pt = BertModel.from_pretrained(\"bert-base-cased\").to(device)\n",
    "    for _ in trange(10, desc=\"Warming up\"):\n",
    "      model_pt(**model_inputs_on_device)\n",
    "\n",
    "    # Compute \n",
    "    time_buffer = []\n",
    "    for _ in trange(100, desc=f\"Tracking inference time on PyTorch\"):\n",
    "      with track_infer_time(time_buffer):\n",
    "        model_pt(**model_inputs_on_device)\n",
    "\n",
    "    # Store the result\n",
    "    results[label] = OnnxInferenceResult(\n",
    "        time_buffer, \n",
    "        None\n",
    "    ) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Kda1e7TkEqNR"
   },
   "source": [
    "## Benchmarking PyTorch & ONNX on CPU\n",
    "\n",
    "_**Disclamer: results may vary from the actual hardware used to run the model**_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 170
    },
    "colab_type": "code",
    "id": "WcdFZCvImVig",
    "outputId": "bfd779a1-0bc7-42db-8587-e52a485ec5e3"
   },
   "outputs": [],
   "source": [
    "PROVIDERS = {\n",
    "    (\"CPUExecutionProvider\", \"ONNX CPU\"),\n",
    "#  Uncomment this line to enable GPU benchmarking\n",
    "#     (\"CUDAExecutionProvider\", \"ONNX GPU\")\n",
    "}\n",
    "\n",
    "\n",
    "for provider, label in PROVIDERS:\n",
    "    # Create the model with the specified provider\n",
    "    model = create_model_for_provider(\"onnx/bert-base-cased.onnx\", provider)\n",
    "\n",
    "    # Keep track of the inference time\n",
    "    time_buffer = []\n",
    "\n",
    "    # Warm up the model\n",
    "    model.run(None, inputs_onnx)\n",
    "\n",
    "    # Compute \n",
    "    for _ in trange(100, desc=f\"Tracking inference time on {provider}\"):\n",
    "      with track_infer_time(time_buffer):\n",
    "          model.run(None, inputs_onnx)\n",
    "\n",
    "    # Store the result\n",
    "    results[label] = OnnxInferenceResult(\n",
    "      time_buffer,\n",
    "      model.get_session_options().optimized_model_filepath\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import os\n",
    "\n",
    "\n",
    "# Compute average inference time + std\n",
    "time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}\n",
    "time_results_std = np.std([v.model_inference_time for v in results.values()]) * 1000\n",
    "\n",
    "plt.rcdefaults()\n",
    "fig, ax = plt.subplots(figsize=(16, 12))\n",
    "ax.set_ylabel(\"Avg Inference time (ms)\")\n",
    "ax.set_title(\"Average inference time (ms) for each provider\")\n",
    "ax.bar(time_results.keys(), time_results.values(), yerr=time_results_std)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Quantization support from transformers\n",
    "\n",
    "Quantization enables the use of integers (_instead of floatting point_) arithmetic to run neural networks models faster. From a high-level point of view, quantization works as mapping the float32 ranges of values as int8 with the less loss in the performances of the model.\n",
    "\n",
    "Hugging Face provides a conversion tool as part of the transformers repository to easily export quantized models to ONNX Runtime. For more information, please refer to the following: \n",
    "\n",
    "- [Hugging Face Documentation on ONNX Runtime quantization supports](https://huggingface.co/transformers/master/serialization.html#quantization)\n",
    "- [Intel's Explanation of Quantization](https://nervanasystems.github.io/distiller/quantization.html)\n",
    "\n",
    "With this method, the accuracy of the model remains at the same level than the full-precision model. If you want to see benchmarks on model performances, we recommand reading the [ONNX Runtime notebook](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/Bert-GLUE_OnnxRuntime_quantization.ipynb) on the subject."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Benchmarking PyTorch quantized model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch \n",
    "\n",
    "# Quantize\n",
    "model_pt_quantized = torch.quantization.quantize_dynamic(\n",
    "    model_pt.to(\"cpu\"), {torch.nn.Linear}, dtype=torch.qint8\n",
    ")\n",
    "\n",
    "# Warm up \n",
    "model_pt_quantized(**model_inputs)\n",
    "\n",
    "# Benchmark PyTorch quantized model\n",
    "time_buffer = []\n",
    "for _ in trange(100):\n",
    "    with track_infer_time(time_buffer):\n",
    "        model_pt_quantized(**model_inputs)\n",
    "    \n",
    "results[\"PyTorch CPU Quantized\"] = OnnxInferenceResult(\n",
    "    time_buffer,\n",
    "    None\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Benchmarking ONNX quantized model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers.convert_graph_to_onnx import quantize\n",
    "\n",
    "# Transformers allow you to easily convert float32 model to quantized int8 with ONNX Runtime\n",
    "quantized_model_path = quantize(Path(\"bert.opt.onnx\"))\n",
    "\n",
    "# Then you just have to load through ONNX runtime as you would normally do\n",
    "quantized_model = create_model_for_provider(quantized_model_path.as_posix(), \"CPUExecutionProvider\")\n",
    "\n",
    "# Warm up the overall model to have a fair comparaison\n",
    "outputs = quantized_model.run(None, inputs_onnx)\n",
    "\n",
    "# Evaluate performances\n",
    "time_buffer = []\n",
    "for _ in trange(100, desc=f\"Tracking inference time on CPUExecutionProvider with quantized model\"):\n",
    "    with track_infer_time(time_buffer):\n",
    "        outputs = quantized_model.run(None, inputs_onnx)\n",
    "\n",
    "# Store the result\n",
    "results[\"ONNX CPU Quantized\"] = OnnxInferenceResult(\n",
    "    time_buffer, \n",
    "    quantized_model_path\n",
    ") "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show the inference performance of each providers "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 676
    },
    "colab_type": "code",
    "id": "dj-rS8AcqRZQ",
    "outputId": "b4bf07d1-a7b4-4eff-e6bd-d5d424fd17fb"
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import os\n",
    "\n",
    "\n",
    "# Compute average inference time + std\n",
    "time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}\n",
    "time_results_std = np.std([v.model_inference_time for v in results.values()]) * 1000\n",
    "\n",
    "plt.rcdefaults()\n",
    "fig, ax = plt.subplots(figsize=(16, 12))\n",
    "ax.set_ylabel(\"Avg Inference time (ms)\")\n",
    "ax.set_title(\"Average inference time (ms) for each provider\")\n",
    "ax.bar(time_results.keys(), time_results.values(), yerr=time_results_std)\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "ONNX Export",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
